Student Name: Sarah FREMOND
Project Title: To what extent Yelp is a powerful indicator - a study case in Arizona
import warnings
warnings.filterwarnings('ignore')
# usual libraries
import pandas as pd
import json
import random
from collections import Counter
import numpy as np
from scipy import stats
from pprint import pprint
# visual libraries
%matplotlib inline
import folium
from folium.plugins import HeatMap
from mpl_toolkits.mplot3d import axes3d, Axes3D
import matplotlib.pyplot as plt
from matplotlib import gridspec
import seaborn as sns
import matplotlib.font_manager as fm
from matplotlib.collections import QuadMesh
#librairies of text preprocessing
import re
#import contractions
import unicodedata
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
#libraries from scikit learn
from sklearn.metrics.pairwise import cosine_similarity, haversine_distances
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.manifold import TSNE
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import gensim
from wordcloud import WordCloud
from textblob import TextBlob
from fastdtw import fastdtw
from numpy.linalg import norm
from scipy.sparse import hstack
from scipy.spatial.distance import cdist
The data have been downloaded from Yelp website.
with open('./yelp_dataset/business.json', encoding="utf8") as f:
data = f.readlines()
data_ = [json.loads(line) for line in data] #convert string to dict format
business = pd.read_json(json.dumps(data_)) # Load into dataframe
del data
del data_
this will be used later on
%time review_arizona = pd.read_json('./yelp_dataset/review_arizona_json.json')
business.head(3)
business.describe()
comments : The json business file downloaded from Yelp website contains basic information about businesses recorded in Yelp platform. We have mostly spatial information from coordinates, address, state, postal_code. We also get meaningful information from the Yelp overall score and the number of reviews. In average, businesses are getting 33 reviews and an overall star of 3.58 out of 5.
business[business.duplicated(['business_id'])]
comments: there is no duplicates based on the business id. Yelp dataset gets therefore 192 098 businesses that signed up.
business.isnull().sum()
comments : Some missing values are found in the dataframe. They are all about attributes, categories and hours. Hours feature is not needed in this study so it's removed. Categories are lists of keywords explaining the type of business and this is used later. Hence observations with nan values in categories are removed. Regarding attributes, it will be used at the end of the analysis so it's left like this at this time.
business = business.drop(['hours'], axis=1) #column hours is removed
business.dropna(subset=['categories'], inplace=True) #rows with nan values in categories are removed.
Since longitute and latitude illustrates spatial information, address, postal_code and city are removed. States is kept to show some analysis.
business = business.drop(['postal_code', 'address', 'city'], axis=1) # remove postal_code, address, city
State information is useful for the first analysis. Short forms are changed into full names to help us identify the meaning of the state.
print(np.unique(business.state))
# we change the name of state to get a more meaningfull feature :
states = {
'AB': 'alberta',
'AK' : 'alaska',
'AL': 'alabama',
'AR': 'arkansas',
'AZ' : 'arizona',
'BAS' : 'ontario',
'BC': 'ontario',
'CA': 'california',
'CON': 'north_carolina',
'CT' : 'connecticut',
'DOW' : 'DOW',
'DUR' : 'DUR',
'FL': 'florida',
'GA' : 'georgia',
'IL': 'illinois',
'NC': 'north_carolina',
'NE': 'nebraska',
'NJ' : 'new_jersey',
'NM': 'new_mexico',
'NV': 'nevada',
'NY': 'new_york',
'OH': 'ohio',
'ON' : 'ontario',
'PA': 'pennsylvania',
'QC': 'quebec',
'SC': 'south_carolina',
'TN' : 'tennessee',
'TX': 'texas',
'UT' :'utah',
'VA': 'virginia',
'VT': 'vermont',
'WA': 'washington',
'WI': 'wisconsin',
'XGL': 'XGL',
'XGM': 'wisconsin',
'XWY': 'XWY',
}
state_val = [states[val] for val in business.state]
business = business.assign(state=state_val)
What about attributes feature ?
Here we just print all uniques keys of attributes because some of them will be used later in the analysis
attributes = []
for my_dict in business.attributes.values:
if my_dict:
for key in my_dict:
attributes.append(key)
print(np.unique(attributes))
The ratio stars per number of reviews is added at this time and will be used throughout the study. However more features will be created later on regarding the insights found.
business['ratio_stars_review'] = business['stars'] / business['review_count']
As it can be seen in the dataframe, the feature names 'categories' contains list of keywords explaining the type of business. In this part, we will try to make this feature clearer and understand the types of businesses.
# here we check the number of different unique categories.
unique_categories = []
for cat_company in business.categories.values:
all_cat_company = cat_company.split(',')
for single_cat in all_cat_company:
unique_categories.append(single_cat.replace('&', '').replace('/', ' ').lstrip().rstrip().lower())
unique_categories = np.unique(unique_categories)
print(len(unique_categories))
print(unique_categories)
comments: Yelp uses 1300 different categories. Therefore, it's difficult to separate manually into group types. We will then use an NLP method along with a clustering to group efficiently businesses based on their type. To do so, we use the google pre-trained Word2vect and the k-means algorithm.
Two solutions can be done :
The first solution has computation memory issues because we have 192 000 rows. Hence we will use the second solution
# loading the google pre trained model
%time w2v = gensim.models.KeyedVectors.load_word2vec_format('./model/GoogleNews-vectors-negative300.bin', binary=True)
# embedding unique categories
SIZE = 300 #neural network of Word2Vect has a depth of 300
matrix_company_categories_embedded = np.zeros((len(unique_categories),SIZE))
for index, category in enumerate(unique_categories):
if category in w2v.vocab:
matrix_company_categories_embedded[index, :] = w2v[category]
else:
category = category.replace(' ', '_')
if category in w2v.vocab:
matrix_company_categories_embedded[index, :] = w2v[category]
else:
category = category.split('_')
count = 0
vector = np.zeros(300)
for c in category:
if c in w2v.vocab:
vector = vector + w2v[c]
count = count+1
#print(c)
if count!=0:
matrix_company_categories_embedded[index, :] = vector/count
else:
matrix_company_categories_embedded[index, :] = vector
print(matrix_company_categories_embedded.shape)
K-means is used to cluster each business by passing into the model the precomputed cosinus distances matrix of the encoded categories.
%time similarities_distances = cosine_similarity(matrix_company_categories_embedded) #matrix of similarities based on the cosinus distance
Sum_of_squared_distances = [KMeans(n_clusters=k).fit(similarities_distances).inertia_ for k in range(1,25)]
plt.figure(figsize=(16,8))
plt.plot(range(1,25), Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()
comments : The elbow method does not show an obvious optimal k. This is usually the case in clustering with Natural language processing. The best k and best clusters are found by trying many computations and checking the composition of each clutser. In this case, k=10 turns out to give good results.
# Fitting K-Means to the dataset
cluster_number = 10
kmeans_ = KMeans(n_clusters = cluster_number, init = 'k-means++', random_state = 0)
%time kmeans_.fit(similarities_distances)
clusters_ = kmeans_.labels_
#centers_ = kmeans_.cluster_centers_
print(Counter(clusters_))
clusters_composition = {}
for k in range(0, cluster_number):
clusters_composition[k] = unique_categories[clusters_==k]
words = []
for i in clusters_composition[k]:
v = i.split(' ')
for vv in v:
if vv != '':
words.append(vv)
print('Most common words in cluster '+ str(k) +':')
print(Counter(words).most_common(20))
print('--------------------')
comments : Some clusters are better defined and give a better human-sight understanding of a business type. What we need to keep in mind is that clients won't have the same behaviour and type of complain depending on the product or service the business sell.
whereas cluster 2, 5 and 9 are a bit less obvious.
#drag business to the group they belong the most
result_clusters = []
for b in business.categories.values:
all_cat_company = b.split(',')
cluster_belong = -1* np.ones(len(all_cat_company))
for index, single_cat in enumerate(all_cat_company):
single_cat = single_cat.replace('&', '').replace('/', ' ').lstrip().rstrip().lower()
for i in range(0, cluster_number):
if i==6:
pass
if single_cat in clusters_composition[i]:
cluster_belong[index] = i
break
else:
pass
result_clusters.append(stats.mode(cluster_belong, axis=None)[0][0])
results_clusters_theme = {'food_and_drinks' : Counter(result_clusters)[1],
'2' : Counter(result_clusters)[2],
'social_activities' : Counter(result_clusters)[3],
'healthcare' : Counter(result_clusters)[4],
'5' : Counter(result_clusters)[5],
'nationnalities' : Counter(result_clusters)[6],
'stores_and_shops' : Counter(result_clusters)[7],
'life_places' : Counter(result_clusters)[8],
'9' : Counter(result_clusters)[9],
'transportation_services' : Counter(result_clusters)[0],}
pprint(results_clusters_theme)
# add cluster feature to dataframe and drop the former categories feature
clusters_named = []
for i in result_clusters:
if i==0:
clusters_named.append('transportation_services')
elif i==1:
clusters_named.append('food_and_drinks')
elif i==2:
clusters_named.append('2')
elif i==3:
clusters_named.append('social_activities')
elif i==4:
clusters_named.append('healthcare')
elif i==5:
clusters_named.append('5')
elif i==6:
clusters_named.append('nationnalities')
elif i==7:
clusters_named.append('stores_and_shops')
elif i==8:
clusters_named.append('life_places')
else :
clusters_named.append('9')
business['cluster'] = clusters_named
business = business.drop(['categories'], axis=1)
business.head(4)
plt.figure(figsize=(20,6))
plt.title('Cluster sizes')
sns.countplot(x='cluster', data=business, order = business['cluster'].value_counts().index, palette='RdYlBu')
plt.show()
comments: Using a clustering on this feature was really useful. The 1300 unique categories where changed into 10. From the bar plot above, we clearly see that Yelp is mostly used for restaurants and bars.
Since we know there are different types of businesses, we need to know more about Yelp behaviour against those types of businesses and against spatial locations.
y2 = business.groupby('state').size().values
y3 = business.groupby(['state'])['review_count'].sum().values
index = business.groupby('state').size().keys()
y4 = (y3/y3.sum())*100
y5 = (y2/y2.sum())*100
y6 = business.groupby(['state'])['stars'].mean()
y7 = business.groupby(['state'])['stars'].std()
y8 = business.groupby(['state'])['ratio_stars_review'].std()
y9 = business.groupby(['state'])['ratio_stars_review'].mean()
business_statistics = pd.DataFrame({'nb_of_business': y2,
'nb_of_business_in_perc': y5,
'nb_of_reviews': y3,
'nb_of_reviews_in_perc': y4,
'stars_mean' : y6,
'ratio_stars_review_mean' : y9,
'stars_std' : y7,
'ratio_stars_review_std' : y8,
'state': index}, index=index)
business_statistics
comments : Data are not evenly distributed depending on the state, especially regarding the number of business or number of reviews per state. It goes from a really low number to a high one: 1 business that signed up in Tenessee for example to more than 50 000 in Arizona.
# Thus we drop the states where nb_of_reviews is too low because we need to compare even distribution later on
states_to_drop = business_statistics[(business_statistics.nb_of_reviews<=20) ].index.values
print('states to drop because the number of restaurants is too low (1 or 2)')
print(states_to_drop)
business = business[~business['state'].isin(states_to_drop) ]
business_statistics = business_statistics[~business_statistics['state'].isin(states_to_drop)]
sns.set(style="whitegrid")
f, ax = plt.subplots(figsize=(18, 6))
sns.set_color_codes("muted")
sns.barplot(x="nb_of_business_in_perc", y="state", data=business_statistics,
label="businesses count", color="darkmagenta")
sns.set_color_codes("muted")
sns.barplot(x="nb_of_reviews_in_perc", y="state", data=business_statistics,
label="reviews count", color="b")
ax.legend(ncol=2, loc="upper right", frameon=True)
ax.set(xlim=(0, 35), ylabel="states", xlabel="counts in %", title="Overall statistics of the dataset by state in %")
sns.despine(left=True, bottom=True)
comments :
plt.figure(figsize=(18,6))
plt.subplot(1,2,1)
sns.scatterplot(x='stars', y='review_count', data=business, alpha=0.4)
plt.title('distribution between stars and number of reviews')
plt.ylabel('number of reviews')
plt.xlabel('stars')
plt.subplot(1,2,2)
sns.distplot(business['ratio_stars_review'], bins=20)
plt.title('histogram of the ratio stars per review count')
plt.show()
comments: Overall star is probably already the average of all the stars given by each client. Hence, it's natural that the highest number of reviews tends to an average of stars. Considering the scale of the number of review versus the star, the distribution of the ratio is highly right skewed, with a mean close to 0 as 1/n tends to 0 when n tends to infinity. However, we can notice some outliers in each category in both figures and the distribution is clearly fat tailed. Thoses outliers would mean either that the number of reviews is abnormaly high and there is a consensus within users (left figure) or either a really low number of reviews (right figure).
business_theme = business.copy()
business_theme = business_theme[business_theme.cluster != '5']
business_theme = business_theme[business_theme.cluster != '2']
business_theme = business_theme[business_theme.cluster != '9']
business_theme = business_theme[business_theme.cluster != 'nationnalities']
The ratio regarding the states
plt.figure(figsize=(18,5))
sns.catplot(x="state", y="ratio_stars_review", kind="box", data=business_theme, height=5, aspect=3)
plt.show()
comments: There is no significant differences between states within the whole dataset considering each state does not have the same number of businesses registered.
The ratio regarding the types
plt.figure(figsize=(18,6))
sns.violinplot(x="cluster", y="ratio_stars_review", data=business_theme, palette='RdYlBu')
plt.title('Ratio stars-review per bsuiness type')
plt.show()
comments : The ratio score per number of reviews depends much more on the type of business rather than the state. Food and drinks type has a siginificant different shape compared to other types. It also has the lowest mean and a lot of outliers.
The ratio regarding the states in the food and drinks type
sns.catplot(x="state", y="ratio_stars_review", kind="box", data=business_theme[(business_theme.cluster == 'food_and_drinks')], height=5, aspect=3)
plt.show()
comments: This is an interesting output. Filtered by 'food and drinks' type of business, there is a significant change betweeen states. This was not seen in the whole dataset. We can still notice a slight difference between states nevada and arizona seems to have outliers and a low mean. Arizona and nevada have the lowest mean and the highest number of outliers.
These findings are however tricky since the ratio stars/reviews will mostly have a different shape because of the number of reviews. Hence in the following part, we will compare these first insights to the ratio of closed restaurants
In this section we want to see if there is a primary link between the number of closed restaurants/bars and the yelp activity according to a given score and number of reviews. To do so, we will check the centrality and the dispersion versus this ratio. The box plot above showed that many outliers were found so robust metrics against outliers are computed: the median instead of the mean and the inter quartile range instead of the standard deviation.
food_and_drinks = business[business.cluster=='food_and_drinks']
counts_ = food_and_drinks.groupby(['state', 'is_open']).size()
states_to_plot = list(np.unique(food_and_drinks.state.values))
states_to_plot.remove('florida')
states_to_plot.remove('texas') #
states_to_plot.remove('new_york') #only 1
ratio_close = []
median_ratio_stars_review = []
iqr_ratio_stars_review = []
for state in states_to_plot:
ratio_close.append(100 * counts_[state][0] / (counts_[state][0] + counts_[state][1]))
median_ratio_stars_review.append(food_and_drinks[food_and_drinks.state == state]['ratio_stars_review'].median())
iqr = food_and_drinks[food_and_drinks.state == state]['ratio_stars_review'].quantile(0.75) - food_and_drinks[food_and_drinks.state == state]['ratio_stars_review'].quantile(0.25)
iqr_ratio_stars_review.append(iqr)
fig = plt.figure(figsize=(18,9))
ax = fig.add_subplot(111, projection='3d')
for i, txt in enumerate(states_to_plot): # plot each point + it's index as text above
label = txt
ax.scatter(ratio_close[i], median_ratio_stars_review[i], iqr_ratio_stars_review[i], color='b', s=40)
ax.text(ratio_close[i], median_ratio_stars_review[i], iqr_ratio_stars_review[i] + 0.02, '%s' % (label), size=12, zorder=1, color='k')
ax.set_xlabel('ratio of close open in %')
ax.set_ylabel('centrality (median)')
ax.set_zlabel('dispersion (iqr)')
ax.set_title('Relation between the ratio of closed restaurants and the ratio stars out of reviews number')
plt.show()
comments : The states that have the highest ratio of closed restaurants in Yelp dataset (around 28%) are those that have the lowest median of the ratio stars/reviews and among the lowest iqr. Thus, this figure highlights that there might be a link between those 2 variables. The further research is done with the arizona case.
arizona_food_and_drinks = business[business.state=='arizona']
arizona_food_and_drinks = arizona_food_and_drinks[arizona_food_and_drinks.cluster=='food_and_drinks']
arizona_food_and_drinks.head(3)
# we don't need the information about state and cluster in the first dataframe
arizona_food_and_drinks = arizona_food_and_drinks.drop(['state', 'cluster'], axis=1)
arizona_food_and_drinks.isnull().sum()
review_arizona.head(3)
#check if there is any missing values
review_arizona.isnull().sum()
#check if there is any review duplicates
review_arizona[review_arizona.duplicated(['review_id'])]
arizona_case = arizona_food_and_drinks.merge(review_arizona, left_on='business_id', right_on='business_id').set_index(['business_id'], drop=False).sort_index()
arizona_case['overall_star'] = arizona_case.stars_x
arizona_case['single_client_star'] = arizona_case.stars_y
arizona_case = arizona_case.drop(['stars_x', 'stars_y'], axis=1)
arizona_case.head(2)
print('unique business id: ', arizona_case.index.unique().shape[0])
print('number total of reviews: ', arizona_case.review_id.count())
arizona_case.isnull().sum()
comment : From now on, single star rating, time of reviewing and content of reviews are available. The features useful, funny and cool are the number of votes received by other users graded as useful, funny or cool.
#we save in one separate dataframe the business information for an easier using
arizona_case_business_info = arizona_case.drop_duplicates(subset=['business_id'], inplace=False).set_index(['business_id']).sort_index()
arizona_case_business_info = arizona_case_business_info.drop(['cool', 'date', 'funny', 'review_id', 'text', 'useful', 'user_id', 'single_client_star'], axis=1)
arizona_case_business_info.head(3)
comments: great now we have all the information needed, sorted by business_id
In this part, we want to check how is computed the overall grade of a business and the effects on this computation to other variables.
adding info about the star rating behaviour of each business
#true mean from each star given by client per business id
arizona_case_business_info['true_mean_stars'] = arizona_case.groupby(arizona_case.index)['single_client_star'].mean()
#true std from each star given by client per business id
arizona_case_business_info['true_std_stars'] = arizona_case.groupby(arizona_case.index)['single_client_star'].std()
#true median from each star given by client per business id
arizona_case_business_info['true_median_stars'] = arizona_case.groupby(arizona_case.index)['single_client_star'].median()
#true inter quantile range
arizona_case_business_info['true_iqr_stars'] = arizona_case.groupby(arizona_case.index)['single_client_star'].quantile(0.75) - arizona_case.groupby(arizona_case.index)['single_client_star'].quantile(0.25)
computing the difference between the true mean and the final Yelp output
yelp_output = arizona_case_business_info['overall_star']
mean_stars = arizona_case_business_info['true_mean_stars']
acceptable_output = np.unique(yelp_output) #yelp range its overall final star between 1 and 5 with a step of 0.5
rounded_mean_stars = []
for number in mean_stars:
rounded_number = acceptable_output.flat[np.abs(acceptable_output - number).argmin()]
rounded_mean_stars.append(rounded_number)
arizona_case_business_info['rounded_mean_stars'] = rounded_mean_stars
arizona_case_business_info['difference_of_output_star'] = arizona_case_business_info['overall_star'] - arizona_case_business_info['rounded_mean_stars']
results
arizona_case_business_info.head()
ratio_outliers_up_down = {
'+0.5' : 100* Counter(yelp_output - rounded_mean_stars)[0.5]/len(yelp_output),
'-0.5' : 100* Counter(yelp_output - rounded_mean_stars)[-0.5]/len(yelp_output),
'-1.0' : 100* Counter(yelp_output - rounded_mean_stars)[-1.0]/len(yelp_output),
'+1.0' : 100* Counter(yelp_output - rounded_mean_stars)[1.0]/len(yelp_output),
'+1.5' : 100* Counter(yelp_output - rounded_mean_stars)[1.5]/len(yelp_output),
}
print('number of up and down rating vs the true rounded average')
print(Counter(yelp_output - rounded_mean_stars))
print('---------------------------------------------------------------')
print('ratio_outliers_up_down')
print(ratio_outliers_up_down)
print('---------------------------------------------------------------')
print('count_outliers_up_down_ for opened')
print(Counter(arizona_case_business_info[arizona_case_business_info.is_open==1]['difference_of_output_star']))
print()
print('count_outliers_up_down_ for close')
print(Counter(arizona_case_business_info[arizona_case_business_info.is_open==0]['difference_of_output_star']))
print('---------------------------------------------------------------')
print('The restaurants that got -1.0 point in their overall star score are:')
print(arizona_case_business_info[arizona_case_business_info.difference_of_output_star==-1.0][['name', 'overall_star', 'rounded_mean_stars']])
print('---------------------------------------------------------------')
print('The restaurants that got +1.0 point in their overall star score are:')
print(arizona_case_business_info[arizona_case_business_info.difference_of_output_star==1.0][['name', 'overall_star', 'rounded_mean_stars']])
print('---------------------------------------------------------------')
print('The restaurants that got +1.5 point in their overall star score are:')
print(arizona_case_business_info[arizona_case_business_info.difference_of_output_star==1.5][['name', 'overall_star', 'rounded_mean_stars']])
plt.figure(figsize=(23, 6))
plt.subplot(1,3,1)
sns.boxplot(x=yelp_output, y=mean_stars - yelp_output, notch=True,
boxprops={'alpha':.8, 'edgecolor':'white', 'facecolor':'navy'},
whiskerprops={'color':'navy'},
medianprops={'color':'white'},
flierprops={'marker':'o', 'markerfacecolor':'white', 'markeredgecolor':'dodgerblue'},
capprops={'color':'navy'})
plt.xlabel('Yelp overall grade')
plt.ylabel('Errors to the true average')
plt.title('A true average computation')
plt.subplot(1,3,2)
colors = np.where(yelp_output==rounded_mean_stars, 'navy', 'royalblue')
colors[1225] = 'r'
colors[2275] = 'r'
colors[6865] = 'g'
colors[8901] = 'g'
colors[7961] = 'gold'
plt.scatter(x=yelp_output, y=rounded_mean_stars, s=20, c = colors, marker='x')
plt.annotate('Long Wong s North Phoenix', (2.0 - 1.2, 3.0 + .07)) #-1
plt.annotate('Cathy s Rum Cake Catering', (2.5 - 1.66, 3.5 + .07)) #-1
plt.annotate('Subway', (3.5 - .1 , 2.5 - .17)) #+1
plt.annotate('Wingstop', (5.0 - .33, 4.0 - .17)) #+1
plt.annotate('Green River Foods', (4.0 +.03, 2.5 +.03)) #+1.5
plt.xlabel('Yelp overall grade')
plt.ylabel('Errors to the true rounded average')
plt.title('A true rounded average computation')
plt.subplot(1,3,3)
sns.countplot(x='difference_of_output_star', data=arizona_case_business_info, hue='is_open', palette='BuPu')
plt.xlabel('Difference between Yelp overall grade and rounded mean computed')
plt.title('Number of up-rating vs down rating')
plt.suptitle('Difference between Yelp final star computation and the true average of all stars')
plt.show()
comments : Most of final Yelp star per business are computed from a true average. 3.7% of the dataset got +0.5 points more in their final compution and 2% got -0.5 points less. For those who got abnormaly lucky or unlucky, the error went up to +1.5 points and down to -1. Regarding a potential leading to the closing, there is no obvious link so far but we investigate a bit more in the following section.
Is there an impact of this difference?
# set up a figure twice as wide as it is tall
fig = plt.figure(figsize=(25,9))
ax = fig.add_subplot(1, 2, 1, projection='3d')
still_open = arizona_case_business_info.is_open==1
ax.scatter(arizona_case_business_info[still_open].difference_of_output_star, arizona_case_business_info[still_open].true_mean_stars, arizona_case_business_info[still_open].true_std_stars, color='b', s=40, alpha=0.8)
ax.scatter(arizona_case_business_info[~still_open].difference_of_output_star, arizona_case_business_info[~still_open].true_mean_stars, arizona_case_business_info[~still_open].true_std_stars, color='r', s=40, alpha=0.2)
ax.set_zlabel('dispersion - std')
ax.set_ylabel('centrality - mean')
ax.set_xlabel('difference of output star')
ax.legend({'still open', 'closed'})
ax.view_init(40, 170)
ax = fig.add_subplot(1, 2, 2, projection='3d')
still_open = arizona_case_business_info.is_open==1
ax.scatter(arizona_case_business_info[still_open].overall_star, arizona_case_business_info[still_open].true_median_stars, arizona_case_business_info[still_open].true_iqr_stars, color='b', s=40, alpha=0.8)
ax.scatter(arizona_case_business_info[~still_open].overall_star, arizona_case_business_info[~still_open].true_median_stars, arizona_case_business_info[~still_open].true_iqr_stars, color='r', s=40, alpha=0.2)
ax.set_zlabel('dispersion - iqr')
ax.set_ylabel('centrality - median')
ax.set_xlabel('Yelp overall star')
ax.view_init(40, 170)
plt.suptitle('Relation between given stars on the open/close')
plt.show()
comment :
In this part, we mainly investigate the impact of the self-rated reviews by funny, usefull or cool.
overview
arizona_case[['useful', 'cool', 'funny']].describe()
plt.figure(figsize=(20, 6), linewidth=5)
plt.subplot(1,3,1)
sns.scatterplot(x ='useful', y='single_client_star', data=arizona_case, hue="single_client_star", legend='brief')
plt.xlabel('Rated "useful" by other users')
plt.ylabel('star score')
plt.subplot(1,3,2)
sns.scatterplot(x ='cool', y='single_client_star', data=arizona_case)
plt.xlabel('Rated "cool" by other users')
plt.ylabel('star score')
plt.subplot(1,3,3)
sns.scatterplot(x = 'funny', y='single_client_star', data=arizona_case, hue="single_client_star", legend='brief')
plt.xlabel('Rated "funny" by other users')
plt.ylabel('star score')
plt.suptitle('Distribution of reviews indicators per star scores')
plt.show()
comments : That is interesting, the funny indicator has an opposite behaviour to the useful one. Indeed, it turns out that the highest rated useful by other users were for a bad star and the highest funy were for a good star. The cool score can be anything so it does not give more information.
adding info to the database
bins = [0,50,100,200,400,600]
arizona_case['useful_binned'] = np.digitize(arizona_case.useful, bins)
arizona_case['cool_binned'] = np.digitize(arizona_case.cool, bins)
arizona_case['funny_binned'] = np.digitize(arizona_case.funny, bins)
Counter(arizona_case['useful_binned'])
test = pd.DataFrame(arizona_case.groupby([arizona_case.index, 'funny']).size())
result_mutl = pd.DataFrame({'prod_funny': np.multiply(test.index.get_level_values('funny').tolist(), test[0].values),
'business_id':test.index.get_level_values('business_id').tolist()} , index=test.index.get_level_values('business_id').tolist())
result_reviews_ind = pd.DataFrame(result_mutl.groupby('business_id').sum().values.flatten(), index=result_mutl.groupby('business_id').sum().index, columns=['weighted_funny'])
result_reviews_ind['ratio_w_funny_reviews'] = result_reviews_ind['weighted_funny'] / arizona_case_business_info['review_count']
test = pd.DataFrame(arizona_case.groupby([arizona_case.index, 'cool']).size())
result_mutl = pd.DataFrame({'prod_cool': np.multiply(test.index.get_level_values('cool').tolist(), test[0].values),
'business_id':test.index.get_level_values('business_id').tolist()} , index=test.index.get_level_values('business_id').tolist())
result_reviews_ind['weighted_cool'] = result_mutl.groupby('business_id').sum().values.flatten()
result_reviews_ind['ratio_w_cool_reviews'] = result_reviews_ind['weighted_cool'] / arizona_case_business_info['review_count']
test = pd.DataFrame(arizona_case.groupby([arizona_case.index, 'useful']).size())
result_mutl = pd.DataFrame({'prod_useful': np.multiply(test.index.get_level_values('useful').tolist(), test[0].values),
'business_id':test.index.get_level_values('business_id').tolist()} , index=test.index.get_level_values('business_id').tolist())
result_reviews_ind['weighted_useful'] = result_mutl.groupby('business_id').sum().values.flatten()
result_reviews_ind['ratio_w_useful_reviews'] = result_reviews_ind['weighted_useful'] / arizona_case_business_info['review_count']
arizona_case_business_info['ratio_w_funny_reviews'] = result_reviews_ind['ratio_w_funny_reviews']
arizona_case_business_info['ratio_w_cool_reviews'] = result_reviews_ind['ratio_w_cool_reviews']
arizona_case_business_info['ratio_w_useful_reviews'] = result_reviews_ind['ratio_w_useful_reviews']
results
arizona_case_business_info.head()
fig = plt.figure(figsize=(20,10))
ax = fig.add_subplot(projection='3d')
ax.set(xlim=(0, 30), ylim=(0, 10))
ax.scatter(arizona_case_business_info[still_open].ratio_w_useful_reviews, arizona_case_business_info[still_open].ratio_w_funny_reviews, arizona_case_business_info[still_open].ratio_stars_review, color='b', s=40, alpha=0.8)
ax.scatter(arizona_case_business_info[~still_open].ratio_w_useful_reviews, arizona_case_business_info[~still_open].ratio_w_funny_reviews, arizona_case_business_info[~still_open].ratio_stars_review, color='r', s=40, alpha=0.2)
ax.set_xlabel('usefull')
ax.set_ylabel('funny')
ax.set_zlabel('star')
ax.view_init(40, 50)
plt.suptitle('Relation between given stars amd given reviews on the open/close')
plt.show()
comments : The high ratio of usefull and funny reviews that seem to lead to a specific level of star, does not show any impact on the close/open information.
This section focuses on the spatial informations.
The touristics places were gathered from source planetware.com/tourist-attractions-/phoenix.
phoenix_city_center = [33.448400, -112.074000] #so it explains attractivity but also rent (the closest to cuty center is, the highest)
tourist_attractions = {
'heard_museum' : [33.472500, -112.072200],
'musical_museum' : [33.667621, -111.978492],
'taliesin_west' : [33.606300, -111.845000],
'old_town_scottsdale' : [33.498400, -111.926100],
'hall_of_flame' : [33.447600, -111.953200],
'phoenix_zoo' : [33.451200, -111.948000],
'capitol' : [33.448100, -112.097100],
'camelback_mountain' : [33.515100, -111.961900],
'south_mountain_park_and_preserve' : [33.346700, -112.084700],
}
* Heat map based on review activity and tourist attractions
comments : Yelp review activity is about Phoenix ans its surroundings. This map shows the popularity of restaurants/ bars based on the number of reviews in Yelp. The highest reviewed restaurants seems to be the closest to city center and tourist attractions.
* map based on this up and down rating
hmap = folium.Map(location=[33.495194,-112.074000], tiles='Stamen Toner', zoom_start=10)
#hm_wide = HeatMap(list(zip(star_frequency.latitude.values, star_frequency.longitude.values)),
# min_opacity=0.2,
# radius=5, blur=10,
# max_zoom=0.001)
#touristics places
for place in tourist_attractions:
lat = tourist_attractions[place][0]
long = tourist_attractions[place][1]
folium.Marker([lat, long], popup=str(place)).add_to(hmap)
#+1.5
folium.Marker([arizona_case_business_info[arizona_case_business_info.difference_of_output_star>1.0].latitude.values[0], arizona_case_business_info[arizona_case_business_info.difference_of_output_star>1.0].longitude.values[0]],
popup = arizona_case_business_info[arizona_case_business_info.difference_of_output_star>1.0]['name'].iloc[0],
icon=folium.Icon(color='orange')).add_to(hmap)
#+1.0
folium.Marker([arizona_case_business_info[arizona_case_business_info.difference_of_output_star==1.0]['latitude'].iloc[0], arizona_case_business_info[arizona_case_business_info.difference_of_output_star==1.0]['longitude'].iloc[0]],
popup = arizona_case_business_info[arizona_case_business_info.difference_of_output_star==1.0]['name'].iloc[0],
icon=folium.Icon(color='green')).add_to(hmap)
folium.Marker([arizona_case_business_info[arizona_case_business_info.difference_of_output_star==1.0]['latitude'].iloc[1], arizona_case_business_info[arizona_case_business_info.difference_of_output_star==1.0]['longitude'].iloc[1]],
popup = arizona_case_business_info[arizona_case_business_info.difference_of_output_star==1.0]['name'].iloc[1],
icon=folium.Icon(color='green')).add_to(hmap)
#-1.0
folium.Marker([arizona_case_business_info[arizona_case_business_info.difference_of_output_star==-1.0]['latitude'].iloc[0], arizona_case_business_info[arizona_case_business_info.difference_of_output_star==1.0]['longitude'].iloc[0]],
popup = arizona_case_business_info[arizona_case_business_info.difference_of_output_star==-1.0]['name'].iloc[0],
icon=folium.Icon(color='red')).add_to(hmap)
folium.Marker([arizona_case_business_info[arizona_case_business_info.difference_of_output_star==-1.0]['latitude'].iloc[1], arizona_case_business_info[arizona_case_business_info.difference_of_output_star==1.0]['longitude'].iloc[1]],
popup = arizona_case_business_info[arizona_case_business_info.difference_of_output_star==-1.0]['name'].iloc[1],
icon=folium.Icon(color='red')).add_to(hmap)
#+0.5
for loc in arizona_case_business_info[arizona_case_business_info.difference_of_output_star==0.5][['latitude', 'longitude']].values:
folium.Circle(
radius=85,
location=list(loc),
color='darkgreen',
fill=True,
).add_to(hmap)
#-0.5
for loc in arizona_case_business_info[arizona_case_business_info.difference_of_output_star==-0.5][['latitude', 'longitude']].values:
folium.Circle(
radius=90,
location=list(loc),
color='darkred',
fill=True,
).add_to(hmap)
#hmap.add_child(hm_wide)
hmap
comments : there is no obvious link with those outliers and distances. Star grades depend less on the location rather than activity review.
We don't have information about who uses Yelp. According to US Demographics of Yelp Users (), the age is equally distributed, most of them went to college and 64% have an income of 100K+. Unfortunatly, this does not help about knowing if users of Yelp are mostly tourists travelling to another city or long-term habitants. However the maps above show that it seems that further we are from cities, the less activity review is. Hence, we can add this information in the database and turn latitude and longitude into distances. If we consider that Users are either tourists or habitants, restaurant activity will mostly depend on its geo location and its proximity to tourist attractions or city centers or work places.
computation
locations = arizona_case_business_info[['latitude', 'longitude']].values
dist_to_phoenix_city_center = []
for l in range(0,len(locations)):
dist_to_phoenix_city_center.append(haversine_distances([locations[l], phoenix_city_center])[0,1])
arizona_case_business_info['dist_to_phoenix_city_center'] = dist_to_phoenix_city_center
min_dist_to_attraction = []
for l in range(0,len(locations)):
dist_to_attraction = np.zeros((len(tourist_attractions)))
for i, j in enumerate(tourist_attractions):
dist_to_attraction[i] = haversine_distances([locations[l], tourist_attractions[j]])[0,1]
min_dist_to_attraction.append(min(dist_to_attraction))
arizona_case_business_info['min_dist_tourist_attraction'] = min_dist_to_attraction
arizona_case_business_info = arizona_case_business_info.drop(['latitude', 'longitude'], axis=1) #we don't need it anymore
results
arizona_case_business_info.head()
plt.figure(figsize=(18, 10), linewidth=1)
still_open = arizona_case_business_info.is_open==1
plt.subplot(2,2,1)
sns.kdeplot(arizona_case_business_info[still_open]["dist_to_phoenix_city_center"], shade=True, color="b", label='open retaurants/bars')
sns.kdeplot(arizona_case_business_info[~still_open]["dist_to_phoenix_city_center"], shade=True, color="r", label='close retaurants/bars')
plt.title('Distances to Phoenix city center')
plt.subplot(2,2,2)
sns.kdeplot(arizona_case_business_info[still_open]["min_dist_tourist_attraction"], shade=True, color="b", label='open retaurants/bars')
sns.kdeplot(arizona_case_business_info[~still_open]["min_dist_tourist_attraction"], shade=True, color="r", label='close retaurants/bars')
plt.title('Closest distances to a tourist attraction')
plt.subplot(2,2,3)
sns.kdeplot(arizona_case_business_info[arizona_case_business_info.difference_of_output_star == 0]["dist_to_phoenix_city_center"], shade=True, color="g", label='no outliers')
sns.kdeplot(arizona_case_business_info[arizona_case_business_info.difference_of_output_star > 0]["dist_to_phoenix_city_center"], shade=True, color="pink", label='outliers_up')
sns.kdeplot(arizona_case_business_info[arizona_case_business_info.difference_of_output_star < 0]["dist_to_phoenix_city_center"], shade=True, color="gold", label='outliers_down')
plt.title('Distances to Phoenix city center')
plt.subplot(2,2,4)
sns.kdeplot(arizona_case_business_info[arizona_case_business_info.difference_of_output_star == 0]["min_dist_tourist_attraction"], shade=True, color="g", label='no outliers')
sns.kdeplot(arizona_case_business_info[arizona_case_business_info.difference_of_output_star > 0]["min_dist_tourist_attraction"], shade=True, color="pink", label='outliers_up')
sns.kdeplot(arizona_case_business_info[arizona_case_business_info.difference_of_output_star < 0]["min_dist_tourist_attraction"], shade=True, color="gold", label='outliers_down')
plt.title('Closest distances to a tourist attraction')
plt.suptitle('Distribution of haversines distances')
plt.show()
comments:
arizona_case['date'] = pd.to_datetime(arizona_case['date'], format='%d/%m/%Y %H:%M:%S')
arizona_case.set_index(arizona_case['date'], inplace=True) #we change the index by the date
print('reviews in arizona started from', arizona_case.date.min(), 'to :', arizona_case.date.max())
arizona_case.head(2)
monthly_2018 = arizona_case[arizona_case.index.year==2018].resample('M').count()[['date']]
monthly_2018 = list(monthly_2018.date.values)
monthly_2018.append(0)
monthly_2017 = arizona_case[arizona_case.index.year==2017].resample('M').count()[['date']]
monthly_2016 = arizona_case[arizona_case.index.year==2016].resample('M').count()[['date']]
monthly_2015 = arizona_case[arizona_case.index.year==2015].resample('M').count()[['date']]
monthly_2014 = arizona_case[arizona_case.index.year==2014].resample('M').count()[['date']]
monthly_2008 = arizona_case[arizona_case.index.year==2008].resample('M').count()[['date']]
plt.figure(figsize = (18, 6))
r1 = range(0, len(monthly_2017))
barWidth = 0.1
r2 = [x + barWidth for x in r1]
r3 = [x + barWidth for x in r2]
r4 = [x + barWidth for x in r3]
r5 = [x + barWidth for x in r4]
r6 = [x + barWidth for x in r5]
m8 = plt.bar(r1, monthly_2008.date.values, width=barWidth, edgecolor='white', color='pink', label='2008')
m14 = plt.bar(r2, monthly_2014.date.values, width=barWidth, edgecolor='white', color='lightcoral', label='2014')
m15 = plt.bar(r3, monthly_2015.date.values, width=barWidth, edgecolor='white', color='gold', label='2015')
m16 = plt.bar(r4, monthly_2016.date.values, width=barWidth, edgecolor='white', color='c', label='2016')
m17 = plt.bar(r5, monthly_2017.date.values, width=barWidth, edgecolor='white', color='lightblue', label='2017')
m18 = plt.bar(r6, monthly_2018, width=barWidth, edgecolor='white', color='navy', label='2018')
plt.xlabel('Date by month')
plt.ylabel('Number of reviews')
plt.title('Monthly evolution of the number of reviews in Arizona over the past years')
plt.xticks(np.arange(12)+0.3, ('Jan', 'Feb', 'March', 'April', 'May', 'June', 'July', 'August', 'Sept', 'Oct', 'Nov', 'Dec'))
plt.legend(handles=[m8, m14, m15, m16, m17, m18])
plt.show()
comments:
first overview of the most extreme ouliers
we look at the outliers to see if it gives us ideas about time patterns
plt.figure(figsize=(20,6))
#plt.subplot(1,2,1)
sns.scatterplot(x='date', y='single_client_star', data=arizona_case[arizona_case.business_id=='pRitNmkahcveSE9XsoEdMw'], label='wingstop +1')
sns.scatterplot(x='date', y='single_client_star', data=arizona_case[arizona_case.business_id=='jndyQsXTF2tICjZ605X2Cw'], label='Green River Foods +1.5')
sns.scatterplot(x='date', y='single_client_star', data=arizona_case[arizona_case.business_id=='CFortoBLONVZ8YF-nY5H3A'], label="Cathy's Rum Cake Catering -1")
sns.scatterplot(x='date', y='single_client_star', data=arizona_case[arizona_case.business_id=='d4pXR8yVuJIm6Q3HuRYBfw'], label="Subway +1")
sns.scatterplot(x='date', y='single_client_star', data=arizona_case[arizona_case.business_id=='656XUeBmMzyvU62yYHptxQ'], label="Long Wong's North Phoenix -1")
#plt.xlim(arizona_case.date.sort_values()[180000], arizona_case.date.max())
#plt.subplot(1,2,2)
#sns.scatterplot(x='date', y='single_client_star', data=arizona_case[arizona_case['business_id'].isin(list(unique_business_info[unique_business_info.difference_of_output_star==0].index))], )
#sns.scatterplot(x='date', y='single_client_star', data=arizona_case[arizona_case['business_id'].isin(list(unique_business_info[unique_business_info.difference_of_output_star==0].index))], )
#ax.set_xlim(arizona_case.date.sort_values()[180000], arizona_case.date.max())
#ax.set_title('inliers')
plt.show()
comments : Looking at outliers does not help to find any frequency patterns. Hence we will create in the following section time series features and analyse the outputs:
adding time information
#number of days between the beginning of Yelp of first review of eacg business
arizona_case_business_info['born_in_yelp_d'] = (arizona_case.groupby(['business_id'])['date'].min() - arizona_case.index.min()).astype('timedelta64[D]')
#number of days between the first review they got and the last one
arizona_case_business_info['age_in_yelp_d'] = (arizona_case.groupby(['business_id'])['date'].max() - arizona_case.groupby(['business_id'])['date'].min()).astype('timedelta64[D]')
#number of min hours between 2 reviews
#number of max hours between 2 reviews
#average frequency in hours between 2 reviews
min_between_reviews_in_h = []
max_between_reviews_in_h = []
average_between_reviews_in_h = []
for b_id in arizona_case_business_info.index.values:
diff_between_each_date = arizona_case[arizona_case.business_id==b_id].sort_index().date.diff()
min_between_reviews_in_h.append(diff_between_each_date.min().total_seconds() // 3600)
max_between_reviews_in_h.append(diff_between_each_date.max().total_seconds() // 3600)
average_between_reviews_in_h.append(diff_between_each_date.mean().total_seconds() // 3600)
arizona_case_business_info['min_between_reviews_in_h'] = min_between_reviews_in_h
arizona_case_business_info['max_between_reviews_in_h'] = max_between_reviews_in_h
arizona_case_business_info['average_between_reviews_in_h'] = average_between_reviews_in_h
results:
arizona_case_business_info.head(2)
plt.figure(figsize=(18, 10), linewidth=1)
still_open = arizona_case_business_info.is_open==1
plt.subplot(2,2,1)
sns.kdeplot(arizona_case_business_info[still_open]["born_in_yelp_d"], shade=True, color="b", label='open retaurants/bars')
sns.kdeplot(arizona_case_business_info[~still_open]["born_in_yelp_d"], shade=True, color="r", label='close retaurants/bars')
plt.title('Number of days between first review and Yelp start')
plt.subplot(2,2,2)
sns.kdeplot(arizona_case_business_info[still_open]["age_in_yelp_d"], shade=True, color="b", label='open retaurants/bars')
sns.kdeplot(arizona_case_business_info[~still_open]["age_in_yelp_d"], shade=True, color="r", label='close retaurants/bars')
plt.title('Number of days between first and last review')
plt.subplot(2,2,3)
sns.kdeplot(arizona_case_business_info[arizona_case_business_info.difference_of_output_star == 0]["born_in_yelp_d"], shade=True, color="g", label='no outliers')
sns.kdeplot(arizona_case_business_info[arizona_case_business_info.difference_of_output_star > 0]["born_in_yelp_d"], shade=True, color="pink", label='outliers_up')
sns.kdeplot(arizona_case_business_info[arizona_case_business_info.difference_of_output_star < 0]["born_in_yelp_d"], shade=True, color="gold", label='outliers_down')
plt.title('Number of days between first review and Yelp start')
plt.subplot(2,2,4)
sns.kdeplot(arizona_case_business_info[arizona_case_business_info.difference_of_output_star == 0]["age_in_yelp_d"], shade=True, color="g", label='no outliers')
sns.kdeplot(arizona_case_business_info[arizona_case_business_info.difference_of_output_star > 0]["age_in_yelp_d"], shade=True, color="pink", label='outliers_up')
sns.kdeplot(arizona_case_business_info[arizona_case_business_info.difference_of_output_star < 0]["age_in_yelp_d"], shade=True, color="gold", label='outliers_down')
plt.title('Number of days between first and last review')
plt.suptitle('Distribution of ages')
plt.show()
comments : The distributions of ages toward the target open/close have a different shape. While open restaurants have constant distribution, closed one are roughly right skewed. In average the closed restaurants have recorded into Yelp platform around 3 years after its creation, around 2008 and last around 2 years. Regarding the outlier case, the distribution of first reviews shows that those that got under-rated have in average got their firest review 4 years after Yelp creation. However, there's not obvious pattern about outliers against time and the most likely conclusion is that those outliers are random.
bins = [0,24,72,168,336,730,1460,2229] #within 1, 3 days, within 7 days, 14 days, a month, 2 months
average_between_reviews_in_h_binned = np.digitize(arizona_case_business_info.average_between_reviews_in_h, bins)
arizona_case_business_info['average_between_reviews_in_h_binned'] = average_between_reviews_in_h_binned
print(Counter(average_between_reviews_in_h_binned))
plt.figure(figsize=(18,6))
plt.subplot(1,3,1)
sns.scatterplot(x='min_between_reviews_in_h', y='ratio_stars_review', data=arizona_case_business_info, hue='is_open', palette='Set1')
plt.subplot(1,3,2)
sns.scatterplot(x='max_between_reviews_in_h', y='ratio_stars_review', data=arizona_case_business_info, hue='is_open', palette='Set1')
plt.subplot(1,3,3)
sns.scatterplot(x='average_between_reviews_in_h', y='ratio_stars_review', data=arizona_case_business_info, hue='is_open', palette='Set1')
plt.suptitle('Relation between review frequency and stars/reviews ratio')
plt.show()
plt.figure(figsize=(18,6))
sns.boxplot(x='average_between_reviews_in_h_binned', y='ratio_stars_review', data=arizona_case_business_info, hue='is_open', palette='Set1')
plt.xticks(range(0,len(bins)), ["within 1 D","3 D","7 D", "14 D", "1 M", "2 M", "4M", "more"])
plt.title('average between reviews, a binning visualization')
plt.suptitle('Relation between review frequency and stars/reviews ratio')
comments: It turns out that binning reviews frequencies shows a meaningfull output compared to scatterplot. Indeed, The higher the time is between 2 reviews, the higher is the median and iqr of close restaurant. The review frequency confirm clearly that restaurants can close because of a lack of attractiveness showed here by a low frequency review in Yelp.
In this part, the reviews clients wrote in Arizona are investigated. We mostly want to know the rate of stars missclassification regarding review contents. If this rate turns out to be high, then in addition to 6% of errors in the overall star computation, the overall star we look at is not such significant.
arizona_case_reviews = arizona_case[['review_id', 'text', 'useful_binned', 'funny_binned', 'single_client_star']] #saving in this dataframe the information we look at for easier using.
arizona_case_reviews.head(2)
# Coloring Function
def grey_color_func(word, font_size, position, orientation, random_state=None, **kwargs):
return "hsl(0, 0%%, %d%%)" % random.randint(60, 100)
# Word Cloud initialization
wc = WordCloud(background_color="black", max_words=1000, margin=0)
# Visualize Word Cloud
plt.figure(figsize=(18, 6), linewidth=1)
for i, grade in enumerate([1,2,3,4,5]):
plt.subplot(2, 3, i+1)
wc.generate(arizona_case_reviews.text[arizona_case_reviews.single_client_star==grade].str.cat(sep=' '))
plt.imshow(wc.recolor(color_func=grey_color_func, random_state=3), interpolation='bilinear', aspect='auto')
plt.axis("off")
plt.title('Rated star ' + str(grade))
plt.suptitle('Word Cloud by grade')
plt.tight_layout(rect=[0, 0, 1, .95]);
comments : Word Cloud is used to have a firt sight of the content of reviews per star. This helps to have an idea of how successful can be the text classification. The results show that the most common words throughout categories are different and pretty relevant, especially in category 5, 4 and 1. However, the classification might struggle more about class 2 and 3. How can the sentiment classification can be improved then ? In the following part, we will have a look into emojis and check if this can improve the upcoming classification task.
#gathering emojis
emoticons_happy = set([
':-)', ':)', ';)', ':o)', ':]', ':3', ':c)', ':>', '=]', '8)', '=)', ':}',
':^)', ':-D', ':D', '8-D', '8D', 'x-D', 'xD', 'X-D', 'XD', '=-D', '=D',
'=-3', '=3', ':-))', ":'-)", ":')", ':*', ':^*', '>:P', ':-P', ':P', 'X-P',
'x-p', 'xp', 'XP', ':-p', ':p', '=p', ':-b', ':b', '>:)', '>;)', '>:-)',
'<3'
])
emoticons_sad = set([
':L', ':-/', '>:/', ':S', '>:[', ':@', ':-(', ':[', ':-||', '=L', ':<',
':-[', ':-<', '=\\', '=/', '>:(', ':(', '>.<', ":'-(", ":'(", ':\\', ':-c',
':c', ':{', '>:\\', ';('
])
#creating functions checking if a specific review has a sad or happy sentiment
def is_emoji_happy(txt):
for word in txt.split():
if word in emoticons_happy:
return True
return False
def is_emoji_sad(txt):
for word in txt.split():
if word in emoticons_sad:
return True
return False
emoji_sentiment = []
for rev in arizona_case_reviews['text']:
if is_emoji_happy(rev) and not is_emoji_sad(rev):
emoji_sentiment.append('happy')
elif not is_emoji_happy(rev) and is_emoji_sad(rev):
emoji_sentiment.append('sad')
else:
emoji_sentiment.append('unclear')
arizona_case_reviews['emojis'] = arizona_case_reviews['text'].apply(lambda x: re.sub(r'[\w\s]','',x))
arizona_case_reviews['emoji_sentiment'] = emoji_sentiment
dummies_sentiment = pd.get_dummies(arizona_case_reviews['emoji_sentiment'])
arizona_case_reviews['happy'] = dummies_sentiment['happy']
arizona_case_reviews['sad'] = dummies_sentiment['sad']
results
arizona_case_reviews.head()
plt.figure(figsize=(18,6))
sns.countplot(x='emoji_sentiment', data=arizona_case_reviews[arizona_case_reviews.emoji_sentiment!='unclear'], hue='single_client_star', palette='RdYlBu')
plt.show()
comments : The above count plot shows that emoji recognition seems useful. Happy sentiment through emoji research corresponds clearly to highest grades (4 and 5) and sad emojis to lowest grades. It might help us in the classification accuracy.
arizona_case_reviews['polarity'] = arizona_case_reviews['text'].map(lambda text: TextBlob(text).sentiment.polarity)
results
arizona_case_reviews.head()
plt.figure(figsize=(18,6))
sns.boxplot(x='single_client_star', y='polarity', data=arizona_case_reviews, palette='RdYlBu')
plt.title('Sentiment polarity per given star')
plt.show()
comments: Even though we observe many outliers for each star, the polarity average goes higher as the star increases. This is a good sign of sentiment classification.
In order to see if the graded star matches well with the description, we perform a classification task.
plt.figure(figsize=(10,6))
sns.countplot(x='single_client_star', data=arizona_case_reviews, palette='RdYlBu')
plt.show()
comment: a bit imbalanced regarding classes 5 and 2 but considering the initial amount of observations per class, the train set will have enough data. Also having balanced classes in text classification is less relevant than in other cases.
lemmatizer = WordNetLemmatizer()
def pre_process(text):
# Normalize text to same unicode
# Replacing non-breaking space ('\xc2\xa0') with a space.
text = unicodedata.normalize("NFKD", text)
# Lower Case
text = text.lower()
# Removing Punctuation
text = re.sub(r'[^\w\s]','',text)
#Removal of Stop Words
text = " ".join(x for x in text.split() if x not in stop_words)
#Word Lemmatization
text = " ".join(lemmatizer.lemmatize(x) for x in text.split())
return text
text_preprocessed = arizona_case_reviews['text'].apply(pre_process)
arizona_case_reviews['text_preprocessed'] = text_preprocessed
results
arizona_case_reviews.head()
In this part, the descriptions must be turned into vectors. To do so, we use the tf-idf method. Tf-idf basically computes for each document, the number of times a word w appears within this document divided by the total number of words within the document. Finnaly each ratio is multiplied by the log of the ratio of total number of documents (here reviews) by number of reviews that had the word w.
#splitting dataset first into a training and testing set. We also make sure to use a stratified splitting.
X_train, X_test, y_train, y_test = train_test_split(arizona_case_reviews[['text_preprocessed', 'happy', 'sad', 'polarity']], np.array(arizona_case_reviews.single_client_star), test_size=0.3, random_state = 42, shuffle=True, stratify=np.array(arizona_case_reviews.single_client_star))
#encode text with tf-idf
tfidf_vectorizer = TfidfVectorizer()
tfidf_train = tfidf_vectorizer.fit_transform(X_train.text_preprocessed)
tfidf_test = tfidf_vectorizer.transform(X_test.text_preprocessed)
print(tfidf_train.shape)
print(tfidf_test.shape)
#here the matrix of encoded reviews from tf-idf are improved by the emoji sentiment analysis
# we add to the matrix 2 columns vectors of binary input: isHappy, isSad ? (since we also compute an unclear sentiment analysis)
tfidf_emj_train = hstack((tfidf_train, np.array(X_train['happy'])[:,None]))
tfidf_emj_train = hstack((tfidf_emj_train, np.array(X_train['sad'])[:,None]))
tfidf_emj_test = hstack((tfidf_test, np.array(X_test['happy'])[:,None]))
tfidf_emj_test = hstack((tfidf_emj_test, np.array(X_test['sad'])[:,None]))
print(tfidf_emj_train.shape)
print(tfidf_emj_test.shape)
We try Naive Bayes classifiers and Support vector machine as they are widely used in NLP, especially in text classification and sentiment analysis. They are known to perform well on large feature space.
Multinomial Naive Bayes
#Mutlinomial naive bayes baseline
baseline_NB_tfidf = MultinomialNB().fit(tfidf_train, y_train)
NB_pred_train_tfidf = baseline_NB_tfidf.predict(tfidf_train)
NB_pred_test_tfidf = baseline_NB_tfidf.predict(tfidf_test)
#scores
train_baseline_accuracy_NB_tfidf = accuracy_score(y_train, NB_pred_train_tfidf)
test_baseline_accuracy_NB_tfidf = accuracy_score(y_test, NB_pred_test_tfidf)
print('Baseline Multinomial naive bayes')
print('with tfidf')
print('train', train_baseline_accuracy_NB_tfidf)
print('test', test_baseline_accuracy_NB_tfidf)
print('Baseline Naive Bayes - classification report on train')
print(classification_report(y_train, NB_pred_train_tfidf))
print()
print('Baseline Naive Bayes - classification report on test')
print(classification_report(y_test, NB_pred_test_tfidf))
Support vector machine with a linear kernel
#baseline Support vector machine
baseline_SVC_tfidf = LinearSVC().fit(tfidf_train, y_train)
SVC_pred_train_tfidf = baseline_SVC_tfidf.predict(tfidf_train)
SVC_pred_test_tfidf = baseline_SVC_tfidf.predict(tfidf_test)
#scores
train_baseline_accuracy_SVC_tfidf = accuracy_score(y_train, SVC_pred_train_tfidf)
test_baseline_accuracy_SVC_tfidf = accuracy_score(y_test, SVC_pred_test_tfidf)
print('Baseline Support vector machine')
print('with tfidf')
print('train', train_baseline_accuracy_SVC_tfidf)
print('test', test_baseline_accuracy_SVC_tfidf)
print('Baseline SVM - classification report on train')
print(classification_report(y_train, SVC_pred_train_tfidf))
print()
print('Baseline SVM - classification report on test')
print(classification_report(y_test, SVC_pred_test_tfidf))
comment: On the baseline term, ie without having tunned hyperparameters in tf-idf and the model, classification results shows cleraly that SVM provides better results. We will try to improve SVM by a grid search.
grid search
#pipeline = Pipeline([('TVEC', TfidfVectorizer()), ('SVM', LinearSVC())])
#parameters = {
# 'TVEC__max_features': [200000, 250000],
# 'SVM__C':[0.1, 1, 10] }
#CV = GridSearchCV(pipeline, parameters, cv=5, n_jobs= 1)
#CV.fit(X_train.text_preprocessed, y_train)
#print('Best score and parameter combination = ')
#print(CV.best_score_)
#print(CV.best_params_)
#encode text with tf-idf
best_tfidf_vectorizer = TfidfVectorizer(max_features=200000)
%time best_tfidf_train = best_tfidf_vectorizer.fit_transform(X_train.text_preprocessed)
best_tfidf_test = best_tfidf_vectorizer.transform(X_test.text_preprocessed)
%time best_SVC_tfidf = LinearSVC(C=0.01).fit(best_tfidf_train, y_train)
best_SVC_pred_test_tfidf = best_SVC_tfidf.predict(best_tfidf_test)
print('Best SVM - classification report on test')
print(classification_report(y_test, best_SVC_pred_test_tfidf))
#code source: Wagner Cipriano github
def get_new_fig(fn, figsize=[12,12]):
""" Init graphics """
fig1 = plt.figure(fn, figsize)
ax1 = fig1.gca() #Get Current Axis
ax1.cla() # clear existing plot
return fig1, ax1
#
def configcell_text_and_colors(array_df, lin, col, oText, facecolors, posi, fz, fmt, show_null_values=0):
"""
config cell text and colors
and return text elements to add and to dell
@TODO: use fmt
"""
text_add = []; text_del = [];
cell_val = array_df[lin][col]
tot_all = array_df[-1][-1]
per = (float(cell_val) / tot_all) * 100
curr_column = array_df[:,col]
ccl = len(curr_column)
#last line and/or last column
if(col == (ccl - 1)) or (lin == (ccl - 1)):
#tots and percents
if(cell_val != 0):
if(col == ccl - 1) and (lin == ccl - 1):
tot_rig = 0
for i in range(array_df.shape[0] - 1):
tot_rig += array_df[i][i]
per_ok = (float(tot_rig) / cell_val) * 100
elif(col == ccl - 1):
tot_rig = array_df[lin][lin]
per_ok = (float(tot_rig) / cell_val) * 100
elif(lin == ccl - 1):
tot_rig = array_df[col][col]
per_ok = (float(tot_rig) / cell_val) * 100
per_err = 100 - per_ok
else:
per_ok = per_err = 0
per_ok_s = ['%.2f%%'%(per_ok), '100%'] [int(per_ok == 100)]
#text to DEL
text_del.append(oText)
#text to ADD
font_prop = fm.FontProperties(weight='bold', size=fz)
text_kwargs = dict(color='w', ha="center", va="center", gid='sum', fontproperties=font_prop)
lis_txt = ['%d'%(cell_val), per_ok_s, '%.2f%%'%(per_err)]
lis_kwa = [text_kwargs]
dic = text_kwargs.copy(); dic['color'] = 'g'; lis_kwa.append(dic);
dic = text_kwargs.copy(); dic['color'] = 'r'; lis_kwa.append(dic);
lis_pos = [(oText._x, oText._y-0.3), (oText._x, oText._y), (oText._x, oText._y+0.3)]
for i in range(len(lis_txt)):
newText = dict(x=lis_pos[i][0], y=lis_pos[i][1], text=lis_txt[i], kw=lis_kwa[i])
#print 'lin: %s, col: %s, newText: %s' %(lin, col, newText)
text_add.append(newText)
#print '\n'
#set background color for sum cells (last line and last column)
carr = [0.27, 0.30, 0.27, 1.0]
if(col == ccl - 1) and (lin == ccl - 1):
carr = [0.17, 0.20, 0.17, 1.0]
facecolors[posi] = carr
else:
if(per > 0):
txt = '%s\n%.2f%%' %(cell_val, per)
else:
if(show_null_values == 0):
txt = ''
elif(show_null_values == 1):
txt = '0'
else:
txt = '0\n0.0%'
oText.set_text(txt)
#main diagonal
if(col == lin):
#set color of the textin the diagonal to white
oText.set_color('w')
# set background color in the diagonal to blue
facecolors[posi] = [0.35, 0.8, 0.55, 1.0]
else:
oText.set_color('r')
return text_add, text_del
#
def insert_totals(df_cm):
import numpy as np
""" insert total column and line (the last ones) """
sum_col = []
for c in df_cm.columns:
sum_col.append( df_cm[c].sum() )
sum_lin = []
for item_line in df_cm.iterrows():
sum_lin.append( item_line[1].sum() )
df_cm['sum_lin'] = sum_lin
sum_col.append(np.sum(sum_lin))
df_cm.loc['sum_col'] = sum_col
#print ('\ndf_cm:\n', df_cm, '\n\b\n')
#
def pretty_plot_confusion_matrix(df_cm, annot=True, title='Confusion matrix', cmap="Oranges", fmt='.2f', fz=11,
lw=0.5, cbar=False, figsize=[9,9], show_null_values=0, pred_val_axis='y'):
"""
print conf matrix with default layout (like matlab)
params:
df_cm dataframe (pandas) without totals
annot print text in each cell
cmap Oranges,Oranges_r,YlGnBu,Blues,RdBu, ... see:
fz fontsize
lw linewidth
pred_val_axis where to show the prediction values (x or y axis)
'col' or 'x': show predicted values in columns (x axis) instead lines
'lin' or 'y': show predicted values in lines (y axis)
"""
if(pred_val_axis in ('col', 'x')):
xlbl = 'Predicted'
ylbl = 'Actual'
else:
xlbl = 'Actual'
ylbl = 'Predicted'
df_cm = df_cm.T
# create "Total" column
insert_totals(df_cm)
#this is for print allways in the same window
fig, ax1 = get_new_fig('Conf matrix default', figsize)
#thanks for seaborn
ax = sns.heatmap(df_cm, annot=annot, annot_kws={"size": fz}, linewidths=lw, ax=ax1,
cbar=cbar, cmap=cmap, linecolor='w', fmt=fmt)
#set ticklabels rotation
ax.set_xticklabels(ax.get_xticklabels(), rotation = 45, fontsize = 10)
ax.set_yticklabels(ax.get_yticklabels(), rotation = 25, fontsize = 10)
# Turn off all the ticks
for t in ax.xaxis.get_major_ticks():
t.tick1line.set_visible = False
t.tick2line.set_visible = False
for t in ax.yaxis.get_major_ticks():
t.tick1line.set_visible = False
t.tick2line.set_visible = False
#face colors list
quadmesh = ax.findobj(QuadMesh)[0]
facecolors = quadmesh.get_facecolors()
#iter in text elements
array_df = np.array( df_cm.to_records(index=False).tolist() )
text_add = []; text_del = [];
posi = -1 #from left to right, bottom to top.
for t in ax.collections[0].axes.texts: #ax.texts:
pos = np.array( t.get_position()) - [0.5,0.5]
lin = int(pos[1]); col = int(pos[0]);
posi += 1
#print ('>>> pos: %s, posi: %s, val: %s, txt: %s' %(pos, posi, array_df[lin][col], t.get_text()))
#set text
txt_res = configcell_text_and_colors(array_df, lin, col, t, facecolors, posi, fz, fmt, show_null_values)
text_add.extend(txt_res[0])
text_del.extend(txt_res[1])
#remove the old ones
for item in text_del:
item.remove()
#append the new ones
for item in text_add:
ax.text(item['x'], item['y'], item['text'], **item['kw'])
#titles and legends
ax.set_title(title)
ax.set_xlabel(xlbl)
ax.set_ylabel(ylbl)
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.tight_layout() #set layout slim
plt.show()
# col_names = ['star 1', 'star 2', 'star 3', 'star 4', 'star 5']
pretty_plot_confusion_matrix(pd.DataFrame(confusion_matrix(y_test, best_SVC_pred_test_tfidf), columns=col_names, index=col_names))
comments: Text classification between given stars and review contents seems to be good. Indeed, f1 score accuracy of only 65% come from middle classes since there are more errors on them. Classes 1 and 5 got really good f1 scores (74% and 80%). This means that words used in extreme bad and good reviews are distinctable. Errors of missclassification are mostly around adjacent classes (between stars 1 and 2 and between star 4 and 5) and rarely between star 1 and 5 (0.3 % and 0.67%)
Does adding emoticons sentiment and polarity improves the model?
We have seen that sentiment analysis through emoticons may give helpful insights.
#we add 2 new columns of binary outputs from emoticons analysis at the initial tf-idf matrix
best_tfidf_emj_train = hstack((best_tfidf_train, np.array(X_train['happy'])[:,None]))
best_tfidf_emj_train = hstack((best_tfidf_emj_train, np.array(X_train['sad'])[:,None]))
best_tfidf_emj_test = hstack((best_tfidf_test, np.array(X_test['happy'])[:,None]))
best_tfidf_emj_test = hstack((best_tfidf_emj_test, np.array(X_test['sad'])[:,None]))
best_SVC_tfidf_2 = LinearSVC(C=0.01).fit(best_tfidf_emj_train, y_train)
best_SVC_pred_test_tfidf_2 = best_SVC_tfidf_2.predict(best_tfidf_emj_test)
print('Best SVM + emojis - classification report on test')
print(classification_report(y_test, best_SVC_pred_test_tfidf_2))
pretty_plot_confusion_matrix(pd.DataFrame(confusion_matrix(y_test, best_SVC_pred_test_tfidf_2), columns=col_names, index=col_names))
comment: Basically, adding emoticons does not improve significantly text classification accuracy.
we can conclude from this reseach that Yelp does not seem to contain missclassifications and besides 6% of errors in average computation, the overall star per business is meaningfull. However, this does not mean those reviews were written as non-fake reviews, but this could be another interesting research direction.
In the last part of the study, we want to find out to what extent Yelp is a powerful indicator by giving significant information. Thus, we will try to predict whether a restaurant had closed or is still open.
_arizona_case_business_info = arizona_case_business_info.copy()
_arizona_case_business_info.head(3)
Throughout the study, we manage to create siginificant features among time, space and research around the ratio stars/ reviews. However, the fact that a restaurant close or is still open may come from external features to Yelp, such as the price of the service, the rent (this in particular is included in the feature created earlier of the distance to city center), how a restaurant can extend their revenus (take out and delivery), is a restaurant is easily accessible. These information are inclused in the initial feature called 'attributes'.
price range
price_range = []
for ind, i in enumerate(_arizona_case_business_info.attributes.values):
if i:
if 'RestaurantsPriceRange2' in i:
if i['RestaurantsPriceRange2'] and i['RestaurantsPriceRange2']!='None':
price_range.append(int(i['RestaurantsPriceRange2']))
else:
price_range.append(np.nan)
else:
price_range.append(np.nan)
else:
price_range.append(np.nan)
_arizona_case_business_info['price_range'] = price_range
takeout
def to_binary(sent):
sent=sent.lower()
return 1 if sent=='true' else 0
take_out = []
for ind, i in enumerate(_arizona_case_business_info.attributes.values):
if i:
if 'RestaurantsTakeOut' in i:
if i['RestaurantsTakeOut']:
take_out.append(to_binary(i['RestaurantsTakeOut']))
else:
take_out.append(np.nan)
else:
take_out.append(np.nan)
else:
take_out.append(np.nan)
_arizona_case_business_info['take_out'] = take_out
delivery
delivery = []
for ind, i in enumerate(_arizona_case_business_info.attributes.values):
if i:
if 'RestaurantsDelivery' in i:
if i['RestaurantsDelivery']:
delivery.append(to_binary(i['RestaurantsDelivery']))
else:
delivery.append(np.nan)
else:
delivery.append(np.nan)
else:
delivery.append(np.nan)
_arizona_case_business_info['delivery'] = delivery
accessibility
isCarParking = []
for ind, i in enumerate(_arizona_case_business_info.attributes.values):
if i:
if 'BusinessParking' in i:
dict_ = eval(i['BusinessParking'])
if dict_:
if 'garage' not in dict_ and 'street' not in dict_:
isCarParking.append(np.nan)
elif 'garage' not in dict_ and 'street' in dict_:
isCarParking.append(int(dict_['street']))
elif 'garage' in dict_ and 'street' not in dict_:
isCarParking.append(int(dict_['garage']))
else:
isCarParking.append(int(dict_['garage'] | dict_['street']))
else:
isCarParking.append(np.nan)
else:
isCarParking.append(np.nan)
else:
isCarParking.append(np.nan)
_arizona_case_business_info['isCarParking'] = isCarParking
local or chain The number of same restaurants are added. Let's keep in mind that Yelp activity is about Phoenix and its surroundings so the number of similar restaurants in a such small area can have a significant impact.
nb_similar = []
occurences = Counter(_arizona_case_business_info.name)
for name in arizona_case_business_info.name:
nb_similar.append(occurences[name])
_arizona_case_business_info['nb_similar'] = nb_similar
target and variables setting
#summary of all variables created
print(_arizona_case_business_info.columns)
#choice of variables for prediction :
target = ['is_open']
#continuous
var_continuous = ['ratio_stars_review',
'ratio_w_funny_reviews',
'ratio_w_useful_reviews',
'dist_to_phoenix_city_center',
'min_dist_tourist_attraction',
'true_std_stars',
]
#discrete or categorical
var_discrete = [ 'review_count',
'overall_star',
'true_iqr_stars',
'true_median_stars',
'difference_of_output_star',
'average_between_reviews_in_h_binned',
'min_between_reviews_in_h',
'max_between_reviews_in_h',
'born_in_yelp_d',
'age_in_yelp_d',
'price_range',
'nb_similar',
'take_out',
'delivery',
'isCarParking',
]
clf_var = var_continuous + var_discrete
print(_arizona_case_business_info[clf_var].isnull().sum())
print('----------------------------------------------')
print('total of obs', _arizona_case_business_info.count()[0])
comments : with the last new features created, missing values were included in the dataframe. Let's see of those missing values depend on the same businesses
plt.figure(figsize=(18,5))
sns.heatmap(_arizona_case_business_info[clf_var].isnull().transpose(), cbar=False, xticklabels=False)
plt.show()
comment: The above heat map shows that missing values are usually on the same obversations. Since we have an enough total of 10631 observations and believe that the new features can help in predicting, we drop all missing values.
_arizona_case_business_info.dropna(inplace=True)
_arizona_case_business_info[target + clf_var].describe()
_arizona_case_business_info[target + clf_var].head()
comments: now we have a sample of 6890 rows. standard deviations and means are quite sparsed between features so we will consider scaling the features before modelling.
Before applying the model and going deeper in feature analysis, this part helps us to have a helpful overview of the data. do we have an imbalanced case ? How the data look like ?
count plot
plt.figure(figsize=(7,5))
sns.countplot(x="is_open", data=_arizona_case_business_info, palette='Set1')
plt.title('Proportion restaurants bars still open or close')
plt.legend({'open','close'})
plt.show()
comments: closed is a bit imbalanced compared to open which is not suprising. Undersampling is not required in this case considering the amount of observations. However, We may need to consider a classification model that is not weak against a slight imbalanced case.
T sne visualization
T sne is used as a dimension reduction model for visualization. It's really helpful when the dataset has natural clusters because T sne helps keepping those clusters while plotting in 2D or 3D. This is something that PCA may does not.
tsne = TSNE(n_components=2, perplexity=60).fit_transform(_arizona_case_business_info[clf_var])
df_tsne = pd.DataFrame({'first_component': tsne[:,0], 'second_component': tsne[:,1]})
df_tsne['is_open'] = _arizona_case_business_info.is_open.values
plt.figure(figsize=(18,10))
sns.scatterplot(x='first_component', y='second_component', data=df_tsne, hue='is_open', palette='Set1')
plt.title('Tsne plot of the data')
plt.show()
comment: this is a really interesting plot. Tsne is able to show local structures within the data among the close/open information ! We can see that T sne highlighted by keeping natural clusters, the actual open vs close categories. Hence, it's a good first feeling regarding the potential success of the classification model.
Before applying any kind of models, it's important to check correlation among features. Indeed some models might fail against multicollinearity.
Spearman within continuous variables where linearity is not assumed
plt.figure(figsize=(13,7))
sns.heatmap(_arizona_case_business_info[var_continuous].corr(method='spearman'), annot=True, linewidths=.5, cmap="RdBu", cbar=True, vmin=-1, vmax=1)
plt.show()
comments: Regarding continuous data, correlation can often be measured through Pearson computation or Spearman. While Pearson assumes linear relationship between variables, Spearman does not. In fact Spearman correlation gives advantages such as capturing monotonic relations and being more robust against outliers. In our case, we do not assume any linear relationship especially between Harversine distances and so Spearman correlation is computed. The correlation matrix above shows that there is a bit of correlation between the two indicators of reviews and between the two distances. This seems natural since those variables are derived from the same feature. However, it doesn't reach 0.9 correlation and we do not need to consider mutlicollinearity within features so far.
Cramer's V correlation within discrete variables
#source : https://towardsdatascience.com/the-search-for-categorical-correlation-a1cf7f1888c9
def cramers_v(x, y):
confusion_matrix = pd.crosstab(x,y)
chi2 = stats.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum().sum()
phi2 = chi2/n
r,k = confusion_matrix.shape
phi2corr = max(0, phi2-((k-1)*(r-1))/(n-1))
rcorr = r-((r-1)**2)/(n-1)
kcorr = k-((k-1)**2)/(n-1)
return np.sqrt(phi2corr/min((kcorr-1),(rcorr-1)))
mat_cramer_v = np.zeros((len(var_discrete + target),len(var_discrete + target)))
for iv, v in enumerate(var_discrete + target):
for ivv, vv in enumerate(var_discrete + target):
mat_cramer_v[iv, ivv] = cramers_v(_arizona_case_business_info[v].values, _arizona_case_business_info[vv].values)
plt.figure(figsize=(18,9))
sns.heatmap(pd.DataFrame(mat_cramer_v, index=var_discrete + target, columns=var_discrete + target), annot=True, linewidths=.5, cmap="RdBu", cbar=True, vmin=0, vmax=1)
plt.show()
comments: Regarding correlation between discrete variables, ie categorical or ordinal, it can be measured by Goodman Kruskal's lambda, Phi Coefficient or Cramer's V. They all use a Chi-2 statistical test. Cramer's V is computed above and outputs are shown in the matrix. Unlike Spearman or Pearson that are bounded by -1 and 1, Cramer's V goes from 0 to 1, where 0 means no relationship. The output shows that there is no correlation within these discrete features.
Point biserial correlation between continuous and discrete
mat_pointbiserialr = np.zeros((len(var_continuous),len(var_discrete + target)))
for idx, i in enumerate(var_continuous):
for jdx, j in enumerate(var_discrete + target):
mat_pointbiserialr[idx,jdx] = stats.pointbiserialr(_arizona_case_business_info[i].values, _arizona_case_business_info[j].values)[0]
plt.figure(figsize=(20,6))
sns.heatmap(pd.DataFrame(mat_pointbiserialr, index=var_continuous, columns=var_discrete + target), annot=True, linewidths=.5, cmap="RdBu", cbar=True, vmin=-1, vmax=1)
plt.title('Point biserial corr between discrete and continuous')
plt.show()
comments: Finally, to close the correlation analysis loop, it's important to compute the correlation between the continous and discrete variables. This is done by Point Biserial correlation computation. The highest correlation is between true std star and true iqr star but remains at 0.79. This is clearly natural, since those features explain the stars feature dispersion. Again there is extreme high correlation throughout correlations so we do not need to remove features or use a model handling multicollinearity.
T-sne didn't show any outlier, However we still compute the mahalanobis distance to know a bit more. Mahalanobis distance allows to compute a mutli dimensional distance as it takes in account the covariance matrix and the mean vector of all data.
meanV = _arizona_case_business_info[clf_var].mean().values.reshape(1, len(clf_var))
mahalanobis_dist = cdist(_arizona_case_business_info[clf_var], meanV, metric='mahalanobis').flatten()
plt.figure(figsize=(10,8))
plt.hist(mahalanobis_dist, cumulative=True, density=True, histtype='stepfilled', bins=200, alpha=0.5)
plt.title('Cumulative distribution of mahalanobis distances')
plt.xlim(0, 30)
plt.xlabel('distances')
plt.show()
comments : even though T sne showed barely outliers, we can still remove some rows in order to slightly improve the model accuracy. The cumulative distribution above shows that around 1% of data have a mahalanobis distance greater than 10. So we will keep those having a distance below or equal to 10.
#we remove rows that have a mahalanobis_dist>10
_arizona_case_business_info = _arizona_case_business_info[mahalanobis_dist<=10]
print('new size of the dataset:', _arizona_case_business_info.shape)
when we have a large amount of features, it's always interesting to perform a PCA to have an idea about the variance explained from those features
#scaling before using pca
A = StandardScaler().fit_transform(_arizona_case_business_info[var_continuous]) # we fit on train
B = MinMaxScaler().fit_transform(_arizona_case_business_info[var_discrete])
A_B = pd.DataFrame(np.concatenate((A, B), axis=1))
#pca
pca = PCA(n_components=len(clf_var)).fit(A_B)
print(np.cumsum(pca.explained_variance_ratio_))
plt.figure(figsize=(10,8))
plt.fill_between(x=range(0,len(clf_var)), y1=np.cumsum(pca.explained_variance_ratio_), facecolor="gold", alpha=0.2)
plt.title('Cumulative ratio of the variance explained')
plt.xlabel('number of features')
plt.show()
comments: with only 2 components, more than 50% of the variance of the dataset is explained. 11 components reaches 98% of total explained variance. Using pca could allow us to improve the classification accuracy. However the features will be less interpretable.
In this part, we will perform the prediction of the binary target open/close. So far we found out that in our dataset we have :
splitting
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(_arizona_case_business_info[clf_var], _arizona_case_business_info[target], test_size=0.3, random_state = 42, shuffle=True, stratify=_arizona_case_business_info[target])
print(X_train_2.shape)
print(X_test_2.shape)
scaling
#standard scaling for continuous
z_scaler = StandardScaler().fit(X_train_2[var_continuous]) # we fit on train
continuous_train_scaled = z_scaler.transform(X_train_2[var_continuous]) # we transform on train
continuous_test_scaled = z_scaler.transform(X_test_2[var_continuous]) # we transform on test
#min max scaling for discrete
mm_scaler = MinMaxScaler().fit(X_train_2[var_discrete])
discrete_train_scaled = mm_scaler.transform(X_train_2[var_discrete]) # we transform on train
discrete_test_scaled = mm_scaler.transform(X_test_2[var_discrete]) # we transform on test
X_train_2_scaled = pd.DataFrame(np.concatenate((continuous_train_scaled, discrete_train_scaled), axis=1), columns=[var_continuous + var_discrete], index=X_train_2.index)
X_test_2_scaled = pd.DataFrame(np.concatenate((continuous_test_scaled, discrete_test_scaled), axis=1), columns=[var_continuous + var_discrete], index=X_test_2.index)
X_train_2_scaled.describe()
lr = LogisticRegression(solver='liblinear')
lr.fit(X_train_2_scaled, y_train_2.values.flatten())
lr_pred_train = lr.predict(X_train_2_scaled)
lr_pred_test = lr.predict(X_test_2_scaled)
print('Logistic Regression on train: ')
print(classification_report(y_train_2.values, lr_pred_train))
print()
print('Logistic Regression on test: ')
print(classification_report(y_test_2.values, lr_pred_test))
comments : Logistic regression provides a really good accuracy. It manages to reach 91% of weighted f1 score accuracy.
rf = RandomForestClassifier(n_estimators=100,
criterion='gini',
max_depth=100,
min_samples_split=4,
min_samples_leaf=2,
max_features='auto',
max_leaf_nodes=None,
random_state=42 )
rf.fit(X_train_2_scaled, y_train_2.values.flatten())
rf_pred_train = rf.predict(X_train_2_scaled)
rf_pred_test = rf.predict(X_test_2_scaled)
print('RF on train: ')
print(classification_report(y_train_2.values, rf_pred_train))
print()
print('RF on test: ')
print(classification_report(y_test_2.values, rf_pred_test))
comment: as well as Logistic regression, Random Forest also reach a really high accuracy score.
pretty_plot_confusion_matrix(pd.DataFrame(confusion_matrix(y_test_2, lr_pred_test), columns=['close', 'open'], index=['close', 'open']), title='Confusion matrix - Logistic regression')
pretty_plot_confusion_matrix(pd.DataFrame(confusion_matrix(y_test_2, rf_pred_test), columns=['close', 'open'], index=['close', 'open']), title='Confusion matrix - Random Forest')
comment: errors are mostly False positive for both models.
features importance
lf_coeff = pd.DataFrame(lr.coef_.ravel(), columns=['lr_coeff'], index=clf_var).sort_values('lr_coeff', ascending=False)
lf_coeff
rf_feature_importances = pd.DataFrame(rf.feature_importances_, index = clf_var, columns=['importance']).sort_values('importance',ascending=False)
rf_feature_importances
sns.set(style="whitegrid")
f, ax = plt.subplots(figsize=(18, 6))
sns.set_color_codes("muted")
sns.barplot(x=(lf_coeff.lr_coeff / lf_coeff.lr_coeff.sum())*100, y=lf_coeff.index,
label="Logistic regression", color="darkmagenta")
sns.set_color_codes("muted")
sns.barplot(x=(rf_feature_importances.importance / rf_feature_importances.importance.sum())*100, y=lf_coeff.index,
label="Random Forest", color="b")
ax.legend(ncol=2, loc="right", frameon=True)
ax.set_xlabel('importance level in %')
ax.set_title('Features importances')
sns.despine(left=True, bottom=True)
comments: both models gave to time features the highest explanatory level. born_in_yelp_d is the number of days between the beginning of Yelp and the firt review they got. It is therefore information about their registration in the platform. age_in_yelp_d is the number of days between the firt and last review. This explains then how long they last. Thus, It is not surprising that those features have a really good explanatory level. In fact, the number of days between the firt and last review actually could infer biais in the model since in reality, time they last is quite the same information as open and close. We launch again those models without this feature.
what about without age_in_yelp_d
__X_train_2_scaled = X_train_2_scaled.drop('age_in_yelp_d', axis=1)
__X_test_2_scaled = X_test_2_scaled.drop('age_in_yelp_d', axis=1)
__clf_var = clf_var
__clf_var.remove('age_in_yelp_d')
lr.fit(__X_train_2_scaled, y_train_2.values.flatten())
__lr_pred_test = lr.predict(__X_test_2_scaled)
rf.fit(__X_train_2_scaled, y_train_2.values.flatten())
__rf_pred_test = rf.predict(__X_test_2_scaled)
print('classification report on test set:')
print('Logistic regression')
print(classification_report(y_test_2.values, __lr_pred_test))
print()
print('Random Forest')
print(classification_report(y_test_2.values, __rf_pred_test))
new_importance = pd.DataFrame(lr.coef_.ravel(), columns=['lr_coeff'], index=__clf_var).sort_values('lr_coeff', ascending=False)
new_importance['RF'] = pd.DataFrame(rf.feature_importances_, index = __clf_var, columns=['importance'])
new_importance
sns.set(style="whitegrid")
f, ax = plt.subplots(figsize=(18, 6))
sns.set_color_codes("muted")
sns.barplot(x=(new_importance.lr_coeff / new_importance.lr_coeff.sum())*100, y=new_importance.index,
label="Logistic regression", color="darkmagenta")
sns.set_color_codes("muted")
sns.barplot(x=(new_importance.RF / new_importance.RF.sum())*100, y=new_importance.index,
label="Random Forest", color="b")
ax.legend(ncol=2, loc="right", frameon=True)
ax.set_title('Features importances in %')
sns.despine(left=True, bottom=True)
comment : without the feature 'age_in_yelp' that could have inferred biais in the model, feature importance changes while the weighted f1 score accuracy remains good (81% for logistic regression and 85% for random forest). Logistic regression gave to review count, stars given and review frequencies the highest coefficients. Random Forest also used review count and stars given as important features but also captures in born_in_yelp and nb_similar a high level of explanation.
Across models, Yelp is able to make accurate predictions on a closing restaurant based mostly on review count, stars and review frequency. Yelp is then able to explain how attractive a restaurant is. Furthermore, it seems time insights are more powerful than locations. It's likely that time insights from Yelp also hide other external reason of closing. In fact, as we saw before, most of the closed restaurants last 2 years and got registered in Yelp late 2007/2008 and his time is also the beginning of one of the biggest global financial crisis.